Week 03: Parsing & Top-Down Parsing

What is parsing?

  • Goal: To derive the parse tree of program in grammar.
  • Input: Sequence of tokens from lexer
  • Output: Parse tree of the program.
  • Sample
    • Input: IF ID = ID THEN INT ELSE INT FI
    • Output: (IF-THEN-ELSE (= ID ID) (INT) (INT))

Context-Free Grammars

  • Goal: To separate the token string into many pieces such that every piece is acceptable in grammar.
    • Context-Free means every two pieces are not related.
    • Parser must distinguish between valid and invalid strings of tokens under CFGs.
  • What we need
    • A language for describing valid strings of tokens.
    • A method for distinguishing valid under the language.
  • Note: Programming languages always have recursive structures.
    • An EXPR is ... (recursive)
      • if EXPR then EXPR else EXPR fi
    • Context-free grammars are a natural notation for this recursive structure.
  • Def. A CFG consists of
    • A set of terminals $T$
    • A set of non-terminals $N$
    • A start symbol $S \in N$
    • A set of productions $X \rightarrow Y_1, \dots, Y_N, X\in N, Y_i \in N\cup T\cup \{ \epsilon\}$
      • Ex: $S\rightarrow (S)$
      • How the production works?
        • Begin with a string with only the start symbol $S$
        • Replace any non-terminal $X$ in the string by the right-hand side of some production $X \rightarrow Y_1\dots Y_n$
        • Repeat (2) until there are no non-terminals
          • $X_1\dots X_i X X_{i+1}\dots X_N \rightarrow X_1\dots X_i Y_1\dots Y_k X_{i+1}\dots X_N,\ if\ X\rightarrow Y_1\dots Y_k$
          • $S \rightarrow \dots \rightarrow \alpha_0 \rightarrow \alpha_1 \rightarrow \dots \rightarrow \alpha_n, if\ \alpha_0 \rightarrow^* \alpha_n (\geq 0\ steps)$
  • Def. (Language) Use grammar to define language.
    • Let $G$ be a context-free grammar with start symbol $S$. Then the language $L(G)$ of $G$ is $$\{ a_1\dots a_n\ |\ \forall i\ (a_i\in T)\wedge (S\rightarrow^* a_1\dots a_n) \}$$
    • Terminals ought to be tokens of the language.
    • Membership in a language is yes or no; also need parse tree of the input.
    • Must handle errors gracefully.
    • Need an implementation of CFG’s (e.g., bison)
    • Form of the grammar is important.
      • Many grammars generate the same language.
      • Tools are sensitive to the grammar.

Derivations

  • Goal: To define the process from the start symbol to some valid token string.
    • Def. A derivation is a sequence of productions applying on a start symbol $S$ to sequences without any non-terminal.
    • We can draw a derivation as a tree
      • Start symbol is the tree’s root
      • For a production $X \rightarrow Y_1\dots Y_n$ add children $Y_1\dots Y_n$ to node $X$.
    • Sample
      • $G$: $E\rightarrow E+E\ |\ E*E\ |\ (E)\ |\ id$
      • Target: $id*id+id$
      • Derivation: $E \rightarrow E+E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
      • Draw derivation as a parse tree: $(((id) * (id)) + (id))$
      • The example is a left-most derivation. At each step, replace the left-most non-terminal.
      • Note that right-most and left-most derivations have the same parse tree.
  • A parse tree has
    • Terminals at the leaves
    • Non-terminals at the interior nodes
  • An in-order traversal of the leaves is the original input.
  • We are not just interested in whether $s \in L(G)$.
    • We need a parse tree for $s$.
  • The relation between derivation and parse tree are many to one.
    • ((id)+(id))

Ambiguity

  • Goal: A parse tree is used to describe how we explain the program string. If there are two parse trees for the program, it has two meanings.
  • Sample
    • $G$: $E\rightarrow E+E\ |\ E*E\ |\ (E)\ |\ id$
    • Target: $id*id+id$
    • This string has two parse trees
      • Derivation: $E \rightarrow E+E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
      • Derivation: $E \rightarrow E*E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
  • Def. A grammar is ambiguous, if it has more than one parse tree for some string $s\in L(G)$.
    • A lot of ways to solve it.
    • Most direct method is to rewrite grammar unambiguously.
    • Like, enforces precedence of * over +.
    • Impossible to convert automatically an ambiguous grammar to an unambiguous one.
    • Most tools allow precedence and associativity declarations to disambiguate grammars.

Error

Error

  • Goal
    • To detect non-valid programs
    • To translate the valid ones
    • Sample
      • (Lexer) Errors in Lexical: ...$...
      • (Parser) Errors in Syntax: ...x *%...
      • (Type checker) Errors in Semantic: ...int x; y=x(3);...
      • (Tester/User) Errors in Correctness: program

Error Handler

  • Goal
    • Report errors accurately and clearly
    • Recover from an error quickly
    • Not slow down compilation of valid code
  • What to do
    1. Panic mode
      • Discard tokens until one with a clear role is found. And continue.
      • Bison: use the special terminal error to describe how much input to skip.
    2. Error Productions
      • Specify known common mistakes in the grammar, like 5x and 5*x...
      • Complicates the grammar.
    3. Automatic local or global correction
      • Try token insertions and deletions. (Edit distance)
      • Exhaustive search.

Abstract Syntax Trees (AST)

  • Goal: Represent the parse tree and ignore the details.
  • Form: (op (op1) (op2),...,(opn))
    • (IF-THEN-ELSE (< (x) (5)) (= (y) (4)) (= (y) (3)))

Top-Down Parsing

  • Recursive Descent Parsing
    • The parse tree is constructed
      • From the top
      • From left to right
    • Use DFS to match the token string
      • Let the global pointer next point to the next input token: TOKEN *next = array;
      • Try to match a given token terminal: bool term(TOKEN tok) { return *next++ == tok; }
      • Try to mathc the $n^{th}$ production of S: bool Sn() { ... }
      • For production $E\rightarrow T$: bool E1() { return T(); }
      • For production $E\rightarrow T+E$: bool E2() { return T() && term(PLUS) && E(); }
      • For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E1()) || (next = save, E2()); }
  • Error in Recursive Descent
    • Scenario: $S \rightarrow Sa$
      • It goes into an infinite loop.
    • RD does not work in left-recursive grammar.
      • Def. A left-recursive grammar has a non-terminal $S$, $S\rightarrow^+ S\alpha$, for some $\alpha$
      • Rewrite all production rules to use right-recursion.

In [ ]: